Edge AI / On-Device AI

2024 苹果发布 Apple Intelligence，2025 年 Gemini Nano 进入 Chrome，浏览器原生跑 LLM 成为现实。AI 从"完全云端"走向"云端 + 端侧"混合时代。本篇讲清楚 Edge AI 的工程实现、适用边界、与云端的协作。

学前说明

2023 年讨论 Edge AI 还是学术话题——模型太大跑不动。

2024-2026 三个突破让 Edge AI 工程化：

小模型够强：Llama 3.2 1B/3B、Qwen 3 0.5B、Gemini Nano 在很多任务接近云端模型
硬件加速普及：Apple Neural Engine、NVIDIA Tensor Cores、Snapdragon NPU 进入消费设备
运行时成熟：WebGPU、ONNX Runtime Web、Llama.cpp、MLX 让"任意设备跑模型"可行

为什么前端工程师该关心：

隐私：用户数据不出设备
延迟：本地推理 < 100ms vs 云端 500-2000ms
成本：用户的电脑/手机算力免费
离线：飞机上、地铁里照样用
规模：云端推理成本随用户线性涨，端侧不涨

学习目标

区分四种 Edge AI 部署形态（浏览器、移动 App、桌面、IoT）
用 WebGPU + Transformers.js 在浏览器跑 LLM
用 ONNX Runtime Web / TensorFlow.js 跑专门模型
集成 Apple Intelligence / Gemini Nano（系统级 API）
设计混合云端/本地路由（哪些任务本地、哪些云）
评估 Edge AI 适用边界（什么场景行、不行）

与现有知识的衔接

6-5 开源模型私有化部署：服务端部署（前置）
17 Multi-Model Routing：本地 vs 云端的路由决策
12 Streaming 工程深度：本地模型也要流式

第一章：Edge AI 的真实定位

1.1 三种 "Edge"

不要混淆：

Edge 类型	含义	例子
Edge Compute	在 CDN 节点跑代码	Cloudflare Workers、Vercel Edge
Edge AI	在 CDN 跑小模型	Workers AI
On-Device AI（本篇主聚焦）	在用户设备上跑	iPhone 跑 Apple Intelligence

后两者都叫"Edge"，但工程差别巨大。

1.2 为什么 2026 年 Edge AI 起飞

模型小：Llama 3.2 1B ≈ 2GB
设备强：iPhone 17 Pro Neural Engine = 38 TOPS
浏览器：WebGPU 标准化
系统级：Apple Intelligence、Gemini Nano、Windows Copilot Runtime

四个条件凑齐 = 工程可行。

1.3 适合 vs 不适合 Edge

适合 Edge	适合云端
隐私敏感（健康、银行）	通用知识查询
低延迟（自动补全、实时翻译）	复杂推理
离线场景	长上下文（>100K）
高频简单任务	多模态高质量生成
用户量大（推理成本爆炸）	计算密集（视频生成）
个性化（用户私有数据）	需要最新知识

关键认知：Edge AI 不是替代云端，是和云端协作。

第二章：四种部署形态

2.1 浏览器（最广覆盖）

技术栈：

WebGPU：硬件加速（2024 主流浏览器支持）
Transformers.js：Hugging Face 的 JS SDK
ONNX Runtime Web：跨模型运行时
WebNN：浏览器调用 NPU（2025 起）

适合：

任何 web 应用想加 AI
用户隐私场景
极低延迟交互

不适合：

用户设备老（5+ 年前的电脑）
iOS Safari 一些限制

2.2 移动 App

技术栈：

iOS：Core ML、MLX、Apple Intelligence API
Android：ML Kit、TensorFlow Lite、Gemini Nano API
跨平台：ONNX Runtime Mobile、PyTorch Mobile

适合：

用户每天用的功能
离线必需
个性化推荐

2.3 桌面应用

技术栈：

macOS：MLX（Apple 原生）、llama.cpp
Windows：Windows Copilot Runtime、DirectML
Linux：llama.cpp、ONNX Runtime
跨平台：Ollama（最简单）

适合：

创作工具（Photoshop、视频编辑）
编程工具（Coding Agent 本地版）
隐私工具（笔记、密码管理）

2.4 IoT / 嵌入式

技术栈：

NVIDIA Jetson：完整 GPU 工具链
Coral TPU：Google 边缘 AI
量化模型：TFLite Micro

适合：

工业摄像头
机器人
智能家居

本篇主要聚焦前三种。

第三章：浏览器 Edge AI 实战

3.1 用 Transformers.js 跑模型

最简单的入门：

npm install @huggingface/transformers

import { pipeline } from '@huggingface/transformers';

// 加载模型（首次会下载，浏览器 cache 永久）
const classifier = await pipeline(
  'sentiment-analysis',
  'Xenova/distilbert-base-uncased-finetuned-sst-2-english',
  { device: 'webgpu' }  // 用 WebGPU 加速
);

// 推理
const result = await classifier('I love this product!');
// [{ label: 'POSITIVE', score: 0.999 }]

模型存储：

首次下载到 IndexedDB（持久）
之后秒级加载
支持多 tab 共享

3.2 跑生成式 LLM

import { pipeline } from '@huggingface/transformers';

const generator = await pipeline(
  'text-generation',
  'Xenova/Phi-3-mini-4k-instruct',  // 微软 3.8B 小模型
  { device: 'webgpu', dtype: 'q4' }   // 4-bit 量化
);

const result = await generator(
  'What is 2+2?',
  { max_new_tokens: 100, do_sample: false }
);

约束：

模型加载首次 1-2 GB 下载（用户得忍）
3-7B 参数的模型在中端笔记本能跑
推理速度：10-30 tokens/s（比云端慢但够用）

3.3 流式生成

体验关键：流式输出。

import { TextStreamer } from '@huggingface/transformers';

const streamer = new TextStreamer(tokenizer, {
  skip_prompt: true,
  callback_function: (token) => {
    appendToUI(token);
  }
});

await generator(prompt, {
  max_new_tokens: 500,
  streamer
});

3.4 选择正确的模型

浏览器 Edge AI 选型表（2026 实测）：

任务	推荐模型	大小	速度
文本分类	DistilBERT	250MB	极快
命名实体识别	Xenova/bert-base-NER	400MB	快
通用 chat	Phi-3-mini	2.4GB (q4)	中
代码补全	StarCoder2-3B	1.5GB (q4)	中
多语言	Llama-3.2-3B	1.8GB (q4)	中
视觉理解	Florence-2-base	800MB	中
语音转文字	Whisper-tiny	75MB	快
嵌入向量	all-MiniLM-L6-v2	90MB	极快

3.5 性能优化

WebGPU 必须：

// 检测
if (!navigator.gpu) {
  console.warn('No WebGPU, will be slow');
}

量化：

// 不量化 (fp32): 2GB
// q8: 1GB, 质量损失极小
// q4: 500MB, 质量轻微损失
// q4 是浏览器场景的甜点
{ dtype: 'q4' }

Worker 隔离：

// 主线程跑 UI，模型在 Web Worker 跑
const worker = new Worker('./ai-worker.js', { type: 'module' });
worker.postMessage({ type: 'generate', prompt });
worker.onmessage = (e) => {
  if (e.data.type === 'token') appendToUI(e.data.token);
};

防止 UI 卡顿。

3.6 真实例子：浏览器内 RAG

完全本地 RAG，数据不出浏览器：

// 1. 嵌入模型（本地）
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// 2. 向量索引（IndexedDB）
import { VectorStore } from 'browser-vector-db';
const store = new VectorStore();

// 3. 索引用户的本地文档
for (const doc of userLocalDocs) {
  const embedding = await embedder(doc.content);
  await store.add(doc.id, embedding, doc.content);
}

// 4. 检索
const queryEmbedding = await embedder(userQuestion);
const relevant = await store.search(queryEmbedding, { topK: 3 });

// 5. 本地 LLM 生成答案
const llm = await pipeline('text-generation', 'Xenova/Phi-3-mini');
const answer = await llm(`Context: ${relevant.map(r => r.content).join('\n')}\n\nQuestion: ${userQuestion}`);

整个流程数据不出浏览器。适合：私人笔记、个人知识库、企业内部文档（不能上云的）。

第四章：移动端 Edge AI

4.1 iOS：Apple Intelligence

2024-09 推出。iOS 26 起为开发者开放 API。

// Swift 调用 Apple Intelligence
import FoundationModels  // iOS 26 SDK

let session = LanguageModelSession()
let response = try await session.respond(
  to: "Summarize this email: \(emailText)"
)

特点：

完全本地（敏感任务）+ Private Cloud Compute（复杂任务）
用户无感（系统级）
免费（不算 API 费用）
隐私强（Apple 不存数据）

适合：

iOS App 加 AI 不想自建
隐私强需求
不需要顶级质量

4.2 Android：Gemini Nano

2024 中期推出，2026 主流 Android 设备支持。

// Android：通过 AI Core
import com.google.ai.edge.aicore.GenerativeModel

val model = GenerativeModel(
  generationConfig {
    context = applicationContext
    temperature = 0.7f
    topK = 40
    maxOutputTokens = 200
  }
)

val response = model.generateContent("Summarize: $text")

特点：

Pixel 8+ / 高端旗舰原生支持
系统层 API，多 App 共享
离线可用
免费

4.3 跨平台：ONNX Runtime Mobile

不想锁 iOS/Android 各自 SDK：

// React Native 用 onnxruntime-react-native
import { InferenceSession } from 'onnxruntime-react-native';

const session = await InferenceSession.create(modelPath);
const results = await session.run(inputs);

代价：失去系统集成、需要自己管理模型文件。

4.4 选型决策

场景	iOS	Android
一般文本任务	Apple Intelligence	Gemini Nano
视觉任务	Vision API + Core ML	ML Kit
自定义模型	Core ML（转换）	TensorFlow Lite / ONNX
跨平台 + 自定义	React Native + ONNX Runtime	同上

第五章：桌面端

5.1 Ollama（最简单）

跑本地模型最快的方式：

# 安装
brew install ollama

# 拉模型
ollama pull llama3.2:3b

# 跑（自动起 HTTP server）
ollama serve

# 你的应用调用本地 HTTP
const response = await fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  body: JSON.stringify({
    model: 'llama3.2:3b',
    prompt: 'Hello'
  })
});

适合：

个人开发
本地 Coding Agent
隐私敏感工具

不适合：

分发给非技术用户（要他们装 Ollama）

5.2 macOS：MLX 原生

Apple 自己的 ML 框架，专为 Apple Silicon 优化：

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

特点：

比 llama.cpp 快 30-50%（M 系列芯片）
内存共享（CPU/GPU 不用 copy）
苹果生态优先

5.3 Windows：Copilot Runtime

Windows 11 起内置 AI 运行时：

// C# / WinUI
using Microsoft.AI.Generative;

var session = new GenerativeSession();
var response = await session.GenerateAsync("Hello");

集成进 Windows ML，调用 NPU。Snapdragon X、Intel Lunar Lake、AMD Ryzen AI 都支持。

5.4 分发桌面 AI 应用

打包模型进应用 = 几 GB 安装包。两个策略：

A. 首次启动下载：

async function ensureModel() {
  if (!modelExists()) {
    showProgress('Downloading AI model...');
    await downloadModel('https://cdn.../model.gguf');
  }
}

B. 用户自带 Ollama：

应用检测 Ollama 是否已装
没装就提示用户装
装了就连本地 Ollama

C. 平台内置（推荐）：

macOS：用 MLX + 系统 Apple Intelligence
Windows：用 Windows Copilot Runtime
不需要自带模型

第六章：混合云端/本地路由

参考 17 Multi-Model Routing。Edge AI 让路由维度增加。

6.1 决策维度

function shouldRunLocally(task: Task): boolean {
  // 隐私强需求
  if (task.containsSensitiveData) return true;
  
  // 离线场景
  if (!isOnline()) return true;
  
  // 简单任务（本地能搞定）
  if (task.complexity < 0.5) return true;
  
  // 用户付费层级（云端更贵）
  if (task.user.tier === 'free') return true;
  
  // 实时性要求高
  if (task.maxLatency < 200) return true;
  
  return false;
}

6.2 实战架构

class HybridAgent {
  async chat(message: string, context: Context) {
    const intent = await this.classifyLocally(message);  // 本地分类（极快）
    
    if (intent.needsCloud) {
      return await this.cloudLLM(message);  // 复杂任务上云
    }
    
    return await this.localLLM(message);  // 简单任务本地
  }
  
  // 即使云端任务，也用本地预处理
  async cloudLLM(message: string) {
    // 1. 本地敏感数据脱敏
    const sanitized = await this.localPII.redact(message);
    
    // 2. 本地嵌入向量（不上传原文）
    const embedding = await this.localEmbed(sanitized);
    
    // 3. 服务端检索（用 embedding）
    const docs = await this.serverVectorSearch(embedding);
    
    // 4. 云端生成
    const response = await this.cloudGenerate(sanitized, docs);
    
    // 5. 本地把脱敏字段还原
    return this.localPII.restore(response);
  }
}

6.3 Apple 的 Private Cloud Compute 模式

Apple 2024 提出的设计：

用户请求
  ↓
设备判断：本地能搞定吗？
  ├─ 是 → 本地完成
  └─ 否 → 上传到 Private Cloud Compute
            ↓
        Apple 服务端（不存数据、可审计）
            ↓
        返回结果

值得借鉴：

本地优先（默认）
必要时上云（明确同意）
服务端无存储（合规）
可审计（开源镜像）

6.4 错位部署

不一定整模型在本地。可以拆：

Embedding（本地）+ 向量库（云端）
  ↓ 用户文档不上云，但能用云端知识库

Tokenizer（本地）+ Model（云端）
  ↓ 减少网络数据（但 token 还是上云）

小 LLM 起草（本地）+ 大 LLM 精修（云端）
  ↓ Generator-Critic 跨云端/本地

第七章：性能与体验

7.1 加载时间

第一次进入 AI 功能用户要等：

首次加载（Phi-3-mini-q4，1.5GB）：
- 4G 网络：30-60 秒
- WiFi：5-15 秒
- WebGPU 编译：3-5 秒

之后从 cache：< 2 秒

UX 设计：

function AIFeature() {
  const [stage, setStage] = useState<'downloading' | 'compiling' | 'ready'>('downloading');
  
  useEffect(() => {
    loadModel({
      onDownload: (progress) => setProgress(progress),
      onCompile: () => setStage('compiling'),
      onReady: () => setStage('ready'),
    });
  }, []);
  
  return (
    <>
      {stage === 'downloading' && (
        <ProgressBar progress={progress} message="首次加载 AI 模型（~1.5 GB），之后秒级启动" />
      )}
      {stage === 'compiling' && <Loading message="编译中..." />}
      {stage === 'ready' && <ChatUI />}
    </>
  );
}

明确告知用户"首次慢，以后快"。

7.2 推理速度

设备	模型	速度
M3 MacBook Pro	Phi-3-mini q4	40-60 t/s
Pixel 9	Gemini Nano	30-40 t/s
中端笔记本 (i5+集显)	Phi-3-mini q4	8-15 t/s
老旧笔记本	不建议	< 5 t/s（不可用）

Fallback 策略：

const speed = await benchmarkLocal();
if (speed < 10) {
  // 设备太弱，提示用户用云端版
  fallbackToCloud();
}

7.3 内存

模型常驻内存：

Phi-3-mini q4: 2.5GB 内存
Llama-3.2-3B q4: 2GB
Gemini Nano: 由系统管理

策略：

不用时卸载（idle 5 分钟）
用户进入 AI 功能时再加载
多 tab 共享

第八章：隐私与安全

8.1 Edge AI 的隐私优势

数据真的不离开设备 = 真正的隐私。

适合场景：

健康数据（心率、症状）
财务数据（账单、投资）
通讯（邮件、私聊）
个人笔记
企业敏感文档

8.2 但仍要注意

1. 模型本身可能泄露：

有研究显示：
- 模型对训练数据有"记忆"
- 攻击者可能从模型推断训练内容
- 用大公司发布的模型，不用陌生来源

2. 缓存泄露：

浏览器 IndexedDB 不是绝对安全
- 同设备多用户共享
- 浏览器漏洞可能泄露
- 敏感数据加密后再存

3. 推理日志：

// 反例：本地推理但 log 到云
async function localChat(message: string) {
  const response = await localLLM(message);
  await fetch('/api/log', { body: JSON.stringify({ message, response }) });  // ❌ 隐私破功
  return response;
}

// 正例：本地推理，本地 log
async function localChat(message: string) {
  const response = await localLLM(message);
  await localDB.log({ message, response });
  return response;
}

8.3 模型签名验证

下载模型时验证完整性：

async function downloadModel(url: string, expectedHash: string) {
  const response = await fetch(url);
  const buffer = await response.arrayBuffer();
  
  const hash = await sha256(buffer);
  if (hash !== expectedHash) {
    throw new Error('Model integrity check failed');
  }
  
  return buffer;
}

防止 CDN 被替换 / 中间人攻击。

第九章：实战案例

9.1 案例：浏览器内代码补全

// 类似 Copilot，但完全本地
import { pipeline } from '@huggingface/transformers';

const completer = await pipeline(
  'text-generation',
  'Xenova/starcoder2-3b',
  { device: 'webgpu', dtype: 'q4' }
);

// 编辑器集成
editor.on('input', async (context) => {
  const completion = await completer(
    context.code,
    { max_new_tokens: 30, do_sample: false }
  );
  showInlineSuggestion(completion);
});

优势：

代码不上传（IP 保护）
零 API 费用
离线工作

劣势：

质量略逊于 Copilot
首次加载慢

9.2 案例：本地笔记 AI 搜索

// 笔记应用加 AI 搜索，数据完全本地
class LocalNotesAI {
  async indexNote(note: Note) {
    const embedding = await this.embedder(note.content);
    await this.localVectorDB.add(note.id, embedding);
  }
  
  async semanticSearch(query: string) {
    const queryEmbed = await this.embedder(query);
    return await this.localVectorDB.search(queryEmbed);
  }
  
  async askQuestion(question: string) {
    const relevant = await this.semanticSearch(question);
    const context = relevant.map(r => this.getNote(r.id).content).join('\n');
    return await this.localLLM(`Context: ${context}\n\nQuestion: ${question}`);
  }
}

适合 Obsidian、Notion 类产品。

9.3 案例：移动端实时翻译

// iOS：用 Apple Translation + Apple Intelligence
import Translation
import FoundationModels

func translateAndImprove(_ text: String) async -> String {
  // 1. 标准翻译（不出设备）
  let translation = try await TranslationSession.translate(text, to: "en")
  
  // 2. AI 优化（也不出设备）
  let session = LanguageModelSession()
  let polished = try await session.respond(to: "Make this translation natural: \(translation)")
  
  return polished
}

地铁里、飞机上都能用。

9.4 案例：混合架构客服

class HybridCustomerService {
  async chat(message: string, user: User) {
    // 1. 本地意图分类
    const intent = await this.localClassifier(message);
    
    if (intent.type === 'faq') {
      // 2a. FAQ：本地检索 + 本地生成
      const docs = await this.localKB.search(message);
      return await this.localLLM(`Answer based on: ${docs}`);
    }
    
    if (intent.type === 'order_query') {
      // 2b. 需要后端数据：云端
      const order = await this.cloudAPI.getOrder(user.id);
      return await this.cloudLLM(message, order);  // 用户数据上云了
    }
    
    if (intent.type === 'complaint') {
      // 2c. 复杂情绪：云端大模型
      return await this.cloudLLM(message, { model: 'opus' });
    }
  }
}

90% 流量本地处理（免费、快、隐私），10% 上云（高质量）。

第十章：踩坑总结

10.1 工程层

坑	后果	修正
没检测 WebGPU	推理慢到不可用	必查 + fallback
模型太大	用户等不及关掉	用 q4 量化 + 大于 2GB 警示
主线程跑模型	UI 卡	Worker 隔离
没 cache	每次重下	IndexedDB 持久
多 tab 重复加载	内存爆炸	SharedWorker
推理超时无处理	用户卡	timeout + 提示

10.2 体验层

坑	表现	修正
首次进入直接试图加载	用户懵	"首次需要下载 AI 模型" 明确告知
没流式	等 5 秒看到全部	流式输出（参考 12）
质量低于云端但不告知	用户感觉差	"本地模式：快但质量稍低" 标识
完全替代云端	复杂任务垮掉	混合架构

10.3 安全层

坑	后果	修正
用陌生来源模型	可能投毒	只用 HF 验证仓库
不验证 hash	模型被换	SHA256 验证
推理结果上传到 log	隐私失效	本地 log
模型 cache 不加密	同设备其他用户可读	IndexedDB 加密

第十一章：未来方向

11.1 浏览器原生 AI API

Chrome 已经在测 window.ai：

// 未来浏览器原生
const session = await window.ai.createTextSession();
const response = await session.prompt('Hello');

不用自己加载模型，浏览器内置 Gemini Nano 类小模型。2026-2027 主流化。

11.2 模型大小持续下降

2024：3B 是浏览器极限
2025：1B 模型质量追上 2023 的 3B
2026：500M 模型在某些任务接近 GPT-4

Edge AI 的覆盖面持续扩大。

11.3 NPU 普及

新设备都带 NPU（神经网络处理器）：

Apple M / A 系列
Snapdragon X
Intel Core Ultra
AMD Ryzen AI

浏览器通过 WebNN 调用 NPU。能耗大幅下降。

11.4 标准化协议

类似 MCP，可能出现 "Edge AI Protocol"：

统一调用本地模型的接口
跨厂商兼容
应用无需关心底层

Chrome 的 window.ai 是这个方向。

第十二章：行动清单

如果你想引入 Edge AI：

Web 应用：

装 @huggingface/transformers
试 pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2')（轻量）
测一个嵌入或分类任务
加 WebGPU 检测
加入产品（feature flag 灰度）

iOS 应用：

升 iOS 26 SDK
试 FoundationModels API
找一个不需要云端的功能（摘要、翻译）
加入应用

桌面应用：

装 Ollama 测试
选合适的模型
集成（HTTP 调用）
考虑打包策略

评估前：

用户设备硬件分布
哪些任务真的需要本地（隐私 / 离线 / 低延迟）
团队能维护额外的本地推理栈

权威资料

Transformers.js 文档
WebGPU 规范
ONNX Runtime Web
Apple Intelligence 开发文档
Gemini Nano (AICore)
Ollama
MLX
Llama.cpp
Run LLMs on macOS using llm-mlx (Simon Willison, 2025-02)
6-5 开源模型私有化部署（前置）
17 Multi-Model Routing 与成本优化
12 Streaming 工程深度

核对日期：2026-06-12

学前说明​

学习目标​

与现有知识的衔接​

第一章：Edge AI 的真实定位​

1.1 三种 "Edge"​

1.2 为什么 2026 年 Edge AI 起飞​

1.3 适合 vs 不适合 Edge​

第二章：四种部署形态​

2.1 浏览器（最广覆盖）​

2.2 移动 App​

2.3 桌面应用​

2.4 IoT / 嵌入式​

第三章：浏览器 Edge AI 实战​

3.1 用 Transformers.js 跑模型​

3.2 跑生成式 LLM​

3.3 流式生成​

3.4 选择正确的模型​

3.5 性能优化​

3.6 真实例子：浏览器内 RAG​

第四章：移动端 Edge AI​

4.1 iOS：Apple Intelligence​

4.2 Android：Gemini Nano​

4.3 跨平台：ONNX Runtime Mobile​

4.4 选型决策​

第五章：桌面端​

5.1 Ollama（最简单）​

5.2 macOS：MLX 原生​

5.3 Windows：Copilot Runtime​

5.4 分发桌面 AI 应用​

第六章：混合云端/本地路由​

6.1 决策维度​

6.2 实战架构​

6.3 Apple 的 Private Cloud Compute 模式​

6.4 错位部署​

第七章：性能与体验​

7.1 加载时间​

7.2 推理速度​

7.3 内存​

第八章：隐私与安全​

8.1 Edge AI 的隐私优势​

8.2 但仍要注意​

8.3 模型签名验证​

第九章：实战案例​

9.1 案例：浏览器内代码补全​

9.2 案例：本地笔记 AI 搜索​

9.3 案例：移动端实时翻译​

9.4 案例：混合架构客服​

第十章：踩坑总结​

10.1 工程层​

10.2 体验层​

10.3 安全层​

第十一章：未来方向​

11.1 浏览器原生 AI API​

11.2 模型大小持续下降​

11.3 NPU 普及​

11.4 标准化协议​

第十二章：行动清单​

权威资料​

学前说明

学习目标

与现有知识的衔接

第一章：Edge AI 的真实定位

1.1 三种 "Edge"

1.2 为什么 2026 年 Edge AI 起飞

1.3 适合 vs 不适合 Edge

第二章：四种部署形态

2.1 浏览器（最广覆盖）

2.2 移动 App

2.3 桌面应用

2.4 IoT / 嵌入式

第三章：浏览器 Edge AI 实战

3.1 用 Transformers.js 跑模型

3.2 跑生成式 LLM

3.3 流式生成

3.4 选择正确的模型

3.5 性能优化

3.6 真实例子：浏览器内 RAG

第四章：移动端 Edge AI

4.1 iOS：Apple Intelligence

4.2 Android：Gemini Nano

4.3 跨平台：ONNX Runtime Mobile

4.4 选型决策

第五章：桌面端

5.1 Ollama（最简单）

5.2 macOS：MLX 原生

5.3 Windows：Copilot Runtime

5.4 分发桌面 AI 应用

第六章：混合云端/本地路由

6.1 决策维度

6.2 实战架构

6.3 Apple 的 Private Cloud Compute 模式

6.4 错位部署

第七章：性能与体验

7.1 加载时间

7.2 推理速度

7.3 内存

第八章：隐私与安全

8.1 Edge AI 的隐私优势

8.2 但仍要注意

8.3 模型签名验证

第九章：实战案例

9.1 案例：浏览器内代码补全

9.2 案例：本地笔记 AI 搜索

9.3 案例：移动端实时翻译

9.4 案例：混合架构客服

第十章：踩坑总结

10.1 工程层

10.2 体验层

10.3 安全层

第十一章：未来方向

11.1 浏览器原生 AI API

11.2 模型大小持续下降

11.3 NPU 普及

11.4 标准化协议

第十二章：行动清单

权威资料